BMC Bioinformatics
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match BMC Bioinformatics's content profile, based on 383 papers previously published here. The average preprint has a 0.37% match score for this journal, so anything above that is already an above-average fit.
Vilain, M.; Aris-Brosou, S.
Show abstract
BackgroundThe ever-growing amount of available biological data leads modern analysis to be performed on large datasets. Unfortunately, bioinformatics tools for preprocessing and analyzing data are not always designed to treat such large amounts of data efficiently. Notably, this is the case when encoding DNA and RNA sequences into numerical representations, also called descriptors, before passing them to machine learning models. Furthermore, current Python tools available for this preprocessing step are not well suited to be integrated into pipelines resulting in slow encoding speeds. ResultsWe introduce dna-parser, a Python library written in Rust to encode DNA and RNA sequences into numerical features. The combination of Rust and Python allows to encode sequences rapidly and in parallel across multiple threads while maintaining compatibility with packages from the Python ecosystem. Moreover, this library implements many of the most widely used types of numerical feature schemes coming from bioinformaticss and natural language processing. Conclusiondna-parser is an easy to install Python library that offers many Python wheels for Linux (muslinux and manylinux), macOS, and Windows via pip (https://pypi.org/project/dna-parser/). The open source code is available on GitHub (https://github.com/Mvila035/dna_parser) along with the documentation (https://mvila035.github.io/dna_parser/documentation/).
Wang, Z.; Sudlow, L. C.; Du, J.; Berezin, M. Y.
Show abstract
BackgroundGene Ontology (GO) enrichment analysis is a widely used approach for interpreting high-throughput transcriptomic and genomic data. However, conventional GO over-representation analyses typically yield long, redundant lists of enriched terms that are difficult to apply to biological problems and identify the most relevant biological pathways. ResultsWe present thematicGO, a customizable framework that organizes enriched GO terms into biological themes using a curated keyword-based matching strategy. In this approach, GO enrichment of differentially expressed genes is performed using the g:Profiler Application Programming Interface (API), followed by the score aggregation within each theme from contributing individual GO terms. Side-by-side interpretation against conventional GO annotation workflows demonstrates that thematicGO captures related biological outcomes but at the same time substantially reduces redundancy and improves readability. To enhance accessibility, we implemented an interactive, web-deployed graphical user interface (GUI) that enables users to upload gene lists and explore thematic enrichment results. ConclusionthematicGO simplifies functional enrichment analysis by bridging the gap between granular GO term outputs and higher-level biological interpretation using a theme concept, which can be especially useful for RNA-seq studies that identify differentially expressed genes. The new approach complements an orthogonal standard GO enrichment technique with transparent, theme-based aggregation and comparison against classical GO annotation approaches. thematicGO provides an easy, understandable, and reproducible tool for transcriptomic studies, particularly those involving RNA-seq data and complex biological responses.
Grether, V.; Goldstein, Z. R.; Shelton, J. M.; Chu, T. R.; Hooper, W. F.; Geiger, H.; Corvelo, A.; Martini, R.; Davis, M. B.; Robine, N.; Liao, W.
Show abstract
BackgroundFormalin-fixed paraffin-embedding (FFPE) is a widely used, cost-effective method for long-term storage of clinical samples. However, fixation is known to introduce damage to nucleic acids that can present as artifactual bases in sequencing otherwise absent from higher fidelity storage methods such as fresh freezing (FF). Various machine learning methods exist for filtering these variant artifacts, but benchmarking performance can be difficult without reliable truth sets. In this study, we employ a collection of 90 paired fresh-frozen and formalin-fixed paraffin embedded samples from the same tumor to robustly define real and FFPE-derived, artifactual variation and enable objective evaluation of filtering methods. To address existing shortcomings, we propose a novel explainable boosting machine (EBM) model that improves performance, can be easily updated with new data, requires modest computational resources, and is analysis pipeline agnostic, making it broadly accessible. ResultsWe evaluated several methods for limiting FFPE-derived variant artifacts using cohorts of B-cell lymphoma samples. We found capturing local context around variants to be a highly informative, under-utilized feature set not commonly incorporated into many existing machine learning methods. Consequently, we developed a novel algorithm, FIFA, for filtering FFPE artifacts, which uses an EBM model, an interpretable decision-tree-based learning algorithm, to address some of the existing shortcomings. We used four independent cohorts composed of paired lymphoma and cervical cancer samples and a breast cancer cell line with both FF and FFPE samples to define clearly annotated training and test sets and demonstrated improved performance over existing methods. Additionally, FIFA filtering increased relevant biological signals in FFPE breast cancer datasets distinct from the training and testing sets. The EBM framework employed by FIFA is computationally efficient and easily amenable to incorporation of additional datasets due to its generalized additive modeling of features making it straightforward to incorporate new data into existing models dynamically over time. ConclusionsOur novel FFPE variant artifact filtering tool, FIFA, is a marked improvement over existing methods. It can be easily implemented, post hoc, to supplement existing somatic calling pipelines, training and inference can be run quickly across most compute environments, and it can be easily updated online as new training data becomes available. Accordingly, FIFA represents an important advance in retrospective cancer genomics research by further enhancing access to the vast stores of FFPE-archived tumor samples currently in existence.
Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.
Show abstract
MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it
Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.
Show abstract
BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.
Neubrand, N.; Rachel, T.; Litwin, T.; Timmer, J.; Kreutz, C.; Hess, M.
Show abstract
MotivationSystems biology strives to unravel the complex dynamics of cellular processes, often with the help of ordinary differential equations (ODEs). However, the sparsity of measured data and the strong non-linearity of common ODEs introduce severe numerical problems in typical modeling tasks. This gave rise to the development of many computational algorithms that must be systematically evaluated to ensure optimal method choices. Currently, the amount of well curated models for such benchmarking efforts is insufficient, as building and calibrating biologically reasonable models based on experiments requires years of work. ResultsWe present a large-scale collection of 1100 synthetic modeling problems, generated based on the ODE systems and experimental designs of 22 published modeling problems. This is achieved by extending a recent method for simulation of time-course data for randomly generated observation functions to also include realistic measurement patterns across multiple experimental conditions. By analyzing data and model characteristics, optimization performance and parameter identifiability, we show that the synthetic problems provide both a realistic and diverse extension of the existing problem space. Hence, the synthetic collection provides a valuable resource for benchmarking in dynamic modeling. Availability and ImplementationBenchmark problems and algorithm are publicly available at https://github.com/niklasneubrand/1100SyntheticBenchmarksODE and https://zenodo.org/records/14008247.
Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.
Show abstract
BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.
Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.
Show abstract
BackgroundJoint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. FindingsTo meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. ConclusionsDPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available under a GPLv3 license at https://github.com/BGI-flexlab/DPGT, implemented in Java and C++.
Andrews, B.; Ranganathan, R.
Show abstract
MotivationDNA barcodes are commonly used as a tool to distinguish genuine mutations from sequencing errors in sequencing-based assays. In the presence of indel errors, utilizing barcodes requires accurate alignment of the raw reads to distinguish genuine indels from indel errors. Existing strategies to do this generally rely on aligners built for homology comparison and do not fully utilize quality scores. We reasoned that developing an aligner purpose-built for error correction could yield higher quality barcode-sequence maps. ResultsHere, we present BCAR, a fast barcode-sequence mapper for correcting sequencing errors. BCAR considers all of the evidence for each base call at each position both during alignment and during final consensus generation. BCAR creates high-accuracy barcode-sequence maps from simulated reads across a broad range of error rates and read lengths, outperforming existing methods. We apply BCAR to two experimental datasets, where it generates high-quality barcode-sequence maps. Availability and implementationBCAR source code, documentation and test data are available from: https://github.com/dry-brews/BCAR
Magalhaes, H.; Weber, J.; Klau, G. W.; Marschall, T.; Prodanov, T.
Show abstract
Variation of sequence copy number (CN) between individuals can be associated with phenotypical differences. Consequently, CN calling is an important step for disease association and identification, as well as for genome assembly validation. Traditionally, CN calling is done by mapping sequencing reads to a linear reference genome and estimating the CN from the observed read depth. This approach, however, is significantly hampered by sequences and rearrangements not present in a linear reference genome; at the same time simple CN prediction for individual graph nodes does not make use of the graph topology and can lead to inconsistent results. To address these issues, we propose Floco, a method for CN calling with respect to a genome graph using a network flow formulation. Given a graph and alignments against that graph, we calculate raw CN probabilities for every graph node based on the Negative Binomial distribution and the base pair coverage across the node, and then use integer linear programming to compute the CN flow through the whole graph. We tested this approach on 15 aligned datasets, involving three different graphs, as well as HiFi and ONT sequencing reads and linear assemblies split into reads. These results demonstrate that the addition of the network flow formulation increases the accuracy of CN predictions by up to 43% when compared with read depth based estimation alone. Additionally, we observed that concordance between predictions from the three different sequence sources was able to reach 93.2%. Floco fills a gap in CN calling tools specifically designed for genome graphs.
Qorri, E.; Varga, V.; Priskin, K.; Latinovics, D.; Takacs, B.; Pekker, E.; Jaksa, G.; Csanyi, B.; Torday, L.; Bassam, A.; Kahan, Z.; Pinter, L.; Haracska, L.
Show abstract
BackgroundCircular RNAs (circRNAs) emerged as promising non-invasive cancer biomarkers due to their stability, abundance in body fluids, and regulatory potential. However, circRNA differential expression analysis (DEA) remains challenging, largely owing to lack of consensus on important preprocessing strategies such as filtering and normalization. While well-established bulk RNA-sequencing frameworks are commonly applied to circRNA data, newer approaches such as CIRI-DE (part of CIRI3 suite) integrate both linear and circular transcript information to improve detection. Despite developments, an assessment of these integrative strategies is lacking, and the critical impact of filtering on DEA model performance has not been comprehensively evaluated. ResultsIn this study, we evaluated the impact of multiple normalization and filtering strategies on circRNA DEA using five experimental datasets, including two in-house blood platelet sets and semi-parametric simulated in silico datasets. Our results emphasize the importance of selecting an appropriate filtering threshold, as overly lenient filtering substantially reduced model performance across datasets. We found edgeRs filterByExpr() strategy particularly effective in handling zero counts in circRNA data, while also generating the most reliable results across most datasets. Furthermore, by incorporating linear and circular information as described in CIRI-DE, most methods identified a higher number of differentially expressed (DE) circRNAs compared to circular counts alone. Notably, circRNAs identified by both CIRI-DE and the modified bulk RNA-sequencing pipelines showed substantial overlap. ConclusionOur findings demonstrate that automated filtering combined with linear-aware normalization significantly enhances the sensitivity and reproducibility of circRNA DEA, providing a standardized framework for more reliable biomarker discovery in transcriptomic research.
Gorin, G.; Guruge, D.; Goodman, L.
Show abstract
Rigorous experimental design, including formal power analysis, is a cornerstone of reproducible RNA sequencing (RNA-seq) research. The design of RNA-seq experiments requires computing the minimum sample number required to identify an effect of a particular size at a predefined significance level. Ideally, the statistical test used for the analysis of experimental data should match the test used for sample size determination; however, few tools use the assumptions of the popular differential expression testing framework DESeq2, and most opt for simulation-based rather than analytical approaches. Grounded in the DESeq2 model framework, we derive sample size requirements for both single-cell and bulk RNA-seq experiments delivered as a web-based tool for power analysis, DEPower, available at https://poweranalysis-fb.streamlit.app/ that makes rigorous RNA-seq study design accessible to all researchers.
Bercovich Szulmajster, U.; Wiuf, C.; Albrechtsen, A.
Show abstract
Linkage disequilibrium is a central statistic in population genetic studies, commonly measured by the squared correlation between pairs of genetic variants. An important drawback of this measure is its upward bias caused by a finite sample size. To handle this, different methods exist that correct for sample-size bias. However, because the correlation consists of a ratio, there is no unbiased method to compute it. In this work, we present a procedure to calibrate those methods using a non-parametric approach with simulated data. This is done with forward modeling to generate genotype matrices with known parameters, followed by an inverse mapping to recover estimates of the underlying parameters. Then, a mean-centering calibration is applied to the recovered estimate of the true parameter. This approach is applied to real and simulated data, showing consistent improvement in accuracy compared to other sample-size-aware methods. Furthermore, to study the effects on downstream analyses, we analyze the classification performance on LD pruning, where we also observe an improvement, particularly in extreme cases with low sample sizes of 5 or 10 individuals.
Schilder, B. M.; Skene, N. G.; Murphy, A. E.
Show abstract
MotivationMapping genes across identifier systems and species is a routine but critical step in bioinformatics workflows. Despite its ubiquity, gene mapping is frequently handled using bespoke, ad hoc solutions, increasing duplicated effort and introducing opportunities for error. These issues are exacerbated by the prevalence of non-one-to-one homolog relationships and inconsistent handling of gene identifiers across species and databases, which can compromise downstream analyses and reproducibility. ResultsWe present orthogene, an R/Bioconductor package that simplifies gene mapping within and across hundreds of species. orthogene provides a unified, workflow-oriented framework that integrates automated species and identifier standardization, homolog inference across multiple databases, flexible handling of ambiguous homolog relationships, and transformation of gene lists, tables, and high-dimensional matrices into analysis-ready formats. By abstracting common sources of technical complexity while retaining user control, orthogene enables transparent, reproducible, and scalable gene mapping across a wide range of biological contexts. Availabilityhttps://bioconductor.org/packages/orthogene Contactbrian_schilder@alumni.brown.edu
Raffaelli, G. T.; Kislinger, J.; Kroupa, T.; Hlinka, J.
Show abstract
Background and objectiveQuantifying higher-order statistical dependencies in multivariate biomedical data is essential for understanding collective dynamics in complex systems such as neuronal populations. The connected information framework provides a principled decomposition of the total information content into contributions from interactions of increasing order. However, its application has been limited by the computational complexity of conventional maximum entropy formulations. In this work, we present a generalised formulation of connected information based on maximum entropy problems constrained by entropic quantities. MethodsThe entropic-constraint approach, contrasting with the original constraints based on marginals or moments, transforms the original nonconvex optimisation into a tractable linear program defined over polymatroid cones. This simplification enables efficient, robust estimation even under undersampling conditions. ResultsWe present theoretical foundations, algorithmic implementation, and validation through numerical experiments and real-world data. Applications to symbolic sequences, large-scale neuronal recordings, and DNA sequences demonstrate that the proposed method accurately detects higher-order interactions and remains stable even with limited data. ConclusionsThe accompanying open-source software library, HORDCOIN (Higher ORDer COnnected INformation), provides user-friendly tools for computing connected information using both marginal- and entropy-based formulations. Overall, this work bridges the gap between abstract information-theoretic measures and practical biomedical data analysis, enabling scalable investigation of higher-order dependencies in neurophysiological and other complex biological systems such as the genome.
Mulaudzi, S.; Kulkarni, S.; Marin, M. G.; Farhat, M. R.
Show abstract
BackgroundLow-frequency (minority) variants--variants detectable within a sample at low allele frequencies--are relevant in several areas of research and health, ranging from cancer to pathogen heteroresistance. There is uncertainty around the optimal bioinformatic approach to accurately and reproducibly distinguish low-frequency variants from sequencing or mapping error. To address this we benchmarked seven variant callers on precision, recall and false positive characteristics for detecting low-frequency variants using simulated short-read whole genome sequencing data for 700 Mycobacterium tuberculosis strains. We developed a new low-frequency error model for filtering output of the best performing tool using read mapping and quality metrics. ResultsWe simulated 378 unique variants across 5 genomic backgrounds spanning 4 lineages. Variants were simulated to represent 3 genomic region categories, 10 allele frequencies and 5 sequencing depths. FreeBayes, a haplotype-based variant caller, achieved the highest pooled F1 score of the seven tools in drug resistance regions (average F1 = 0.86) and its higher performance held across genomic context and background. Across tools, we identified lower performance in repetitive (low mappability) regions, and strong reference bias in low-frequency variant calling. We validated variant caller performance on a sample of in-vitro strain mixtures substantiating our ranking. When paired with FreeBayes, the error model excludes 49% of false variants and <1% of true variants. ConclusionsOur analysis provides evidence to support best practices for low-frequency variant calling, including tool choice, masking and filtering. We also develop and provide a new error model that excludes false positive low-frequency variant calls from FreeBayes output.
Warr, M. J.; Dinh, T.; Root, B.; Onstott, E.; Yu, K.; Mudge, J.; Ramaraj, T.; Kahanda, I.; Mumey, B.
Show abstract
In this work, we investigate using motif subsequence features to predict whether a genomic region is accessible to regulatory proteins, i.e. an accessible chromatin region (ACR), enabling transcription of associated genes. We focus on plants, whose agricultural and ecological importance make them interesting and important organisms to study, and whose complex genomes provide important stress tests for our algorithm. We show that motif sequence similarity as found by co-linear chaining can be used in combination with machine learning models to effectively predict ACRs in genome assemblies.
Bleker, C.; Zagorscak, M.; Blejec, A.; Gruden, K.; Zupanic, A.
Show abstract
SummaryBoolean and logic-based modeling approaches are well suited for the analysis of complex biological systems, particularly when detailed biochemical and kinetic information is unavailable. In such settings, biological pathways are represented as networks capturing system components and their interactions, providing a simplified yet informative abstraction of system behavior. While the structural topology of these networks is often well characterized, the absence of mechanistic detail limits the applicability of parameter-dependent modeling frameworks. To address this, we present BoolDog, a Python package for the construction, simulation, and analysis of Boolean and semi-quantitative Boolean networks. BoolDog supports synchronous simulation with events, attractor and steady-state identification, network visualization, and the systematic transformation of logic-based models into continuous ordinary differential equation (ODE) systems -- enabling the seamless integration of discrete and continuous modeling paradigms. Networks can be imported and exported across standard formats, and BoolDog integrates natively with established Python libraries for network analysis and visualisation, including NetworkX, igraph, and py4Cytoscape. Together, these capabilities provide a flexible, accessible, and interoperable platform for logic-based modeling of complex biological systems. Availability and implementationBoolDog is implemented in Python and available at https://github.com/NIB-SI/BoolDog/.
Kalra, A.; Paulin, L.; SEDLAZECK, F.
Show abstract
BackgroundAccurate discrimination of true structural variants (SVs) from artifacts in long-read sequencing data remains a critical bottleneck. Numerous machine learning solutions have been proposed, ranging from classical models using engineered features to advanced deep learning and foundation model interpretability methods. However, a systematic comparison of their performance, efficiency, and practical utility is lacking. ResultsWe conducted a comprehensive benchmark of five machine learning paradigms for SV filtering using standardized Genome in a Bottle (GIAB) data for samples HG002 and HG005. We evaluated classical Random Forest classifiers on 15 genomic features, computer vision models (ResNet/VICReg), diffusion-based anomaly detection, sparse autoencoders (SAEs) on the Evo2-7B foundation model, and multimodal ensembles. A simple Random Forest on interpretable features achieved a peak F1-score of 95.7%, effectively matching all more complex models (ResNet50: 95.9%, Diffusion: 95.8%). This study represents the first application of diffusion-based anomaly detection and sparse autoencoders to structural variant analysis; while diffusion models learned highly discriminative, disentangled representations and SAEs uncovered biologically interpretable features (including atoms that were specific for ALU deletions, chromosome X variants and insertion events), they did not significantly surpass this classification ceiling. Ensemble methods offered no performance benefit but may have future potential given the orthogonality of vision-based and linear features. ConclusionsOur findings demonstrate that for the established task of germline SV filtering, simpler, interpretable models provide an optimal balance of accuracy, speed, and transparency. This benchmark establishes a pragmatic framework for method selection and argues that increased model complexity must be justified by clear, unmet biological needs rather than marginal predictive gains.
Terra, R.; Carvalho, D.; Machado, D. J.; Osthoff, C.; Ocana, K.
Show abstract
Advances in High-Performance Computing (HPC) have enabled increasingly complex genomic analyses, including those in phylogenomics. These analyses contribute to understanding the evolution of viruses and pathogens, improving our knowledge of disease transmission, and supporting targeted public health strategies. However, due to the increasing number of tools and processing steps involved, executing these analyses manually, step by step, becomes error-prone and inefficient. To address this challenge, we present HP2NET, a robust framework for reproducible, efficient, and scalable phylogenetic network analysis. HP2NET integrates five workflows based on state-of-the-art tools such as PhyloNetworks and PhyloNet, allowing the analysis of multiple datasets and workflows in a single execution. The framework includes features such as task packaging and data reuse to improve performance and resource utilization in HPC environments. We perform a comprehensive performance evaluation of the software used within HP2NET, identifying bottlenecks and analyzing gains from parallel processing. Data reuse provided up to 15.35% time reduction, for a small dataset, in our experimental environment, while parallel execution of the five pipelines reduced total runtime by up to 90.96% compared to sequential runs. Finally, we validate HP2NET in a real-world case study by analyzing Dengue virus genomes, demonstrating its applicability value for large-scale phylogenetic analyses.